5 Qualitative Comparative Analysis (QCA) - setup
5.1 Original Data-set Description
The data-set for this analysis comes from an ESGeo draft project for ESG score forecasting. The data-set, as previously mentioned, does not present a codebook for variables so the description of its components will be more of a listing of the variables it presents.
set of Xs
| Variable Name | Short description |
|---|---|
| country | company country |
| continent | company continent |
| sector | company sector |
| tot_rev_19 | total revenues of the company in 2019 |
| P/E-19 | price to earning in 2019. It is the price per share of a company’s compared to the company’s earning per share. It allows investors to determine whether a stock is over- or under-valuated |
| industry_name | TRBC industry name |
| ind_group_name | TRBC industry group name |
| business_sect_name | TRBC business sector name |
| econ_sect_name | TRBC economic sector name |
The TRBC stands for The Refinitiv Business Classification it is a detailed sector and industry classification that presents five levels of granularity:
1. 13 Economic Sectors
2. 33 Business Sectors
3. 62 Industry Groups
4. 154 Industries
5. 898 Activities
For more information on the TRBC Sector Classification check: https://www.refinitiv.com/en/financial-data/indices/trbc-business-classification
For this analysis, we have chosen to work with the first level of granularity: Economic Sector in order not to overload the data when, in the new data-set creation, we have to rearrange each nominal variable into as many dichotomous variables as each nominal category present in the original variable.
set of Ys
| Variable Name | Short description |
|---|---|
| HR_score | Human Resources score |
| ENV_score | Environmental score |
| BUSBEHAV_score | Business Behavior score |
| COMINV_score | Community Involvement score |
| HRts_score | Human Rights score |
| OVERALL_score | Overall ESG score |
These Ys are all concurring to the OVERALL_score variable but the weighing of each variable is not available. In this analysis we will consider only the Overall ESG score as our Y since it is the final score of a company ESG positioning and in order to not overload the new data-set we are going to create.
5.2 Data cleaning and management
5.2.1 Libraries
library(tidyverse)
library(dplyr)
library(hrbrthemes)
library(ggplot2)
library(QCA)
library(QCAtools)
library(ggpubr)5.2.2 Data-set Cleaning
Load the original data-set named “datawhole”
datawhole<-readXL("datawhole.xlsx", rownames = FALSE, header = TRUE, na = "", sheet = 1,
stringsAsFactors = FALSE)5.2.3 Dropping NAs
The dropping of NAs is carried out in 2 steps:
- the first is straightforward as shown in the following code:
datawhole<-drop_na(datawhole) - the second, consists in the dropping of NAs that are coded in the data-set as “Nonspecified sector” or “Unable to collect data for the field ‘TR.TRBCIndustry’ and some specific identifier(s)”:
datawhole<-datawhole[!(datawhole$ind_group_name=="Nonspecified sector"| datawhole$ind_group_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]
datawhole<-datawhole[!(datawhole$business_sect_name=="Nonspecified sector"| datawhole$business_sect_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]
datawhole<-datawhole[!(datawhole$econ_sect_name=="Nonspecified sector"| datawhole$econ_sect_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),]
datawhole<-datawhole[!(datawhole$industry_name=="Nonspecified sector"| datawhole$industry_name=="Unable to collect data for the field 'TR.TRBCIndustry' and some specific identifier(s)."),] 5.2.4 Dropping misclassification
While running through the Economic Sector variable, it has been noticed a misclassification error. The error lays in the fact that the “Telecommunication Services” label is not an economic sector according to the TRBC but a Business Sector.
datawhole<-datawhole[!(datawhole$econ_sect_name=="Telecommunications Services"),]5.2.5 Creating new variables
Once the dropping of NAs is concluded, what we want to do is to create some new variables to prepare the data set for the QCA we will later perform:
- the first is the logarithm of revenues 2019: LOG_REV_19. We do so to manage the great variability in the order of magnitude of the observations due to outliers. Another way we could have managed this could have been to truncate outliers and keep only the observations in an intermediate range -which also happen to be the most numerous- but this operation would have caused us to lose potentially important information.
datawhole['log_rev_19'] <- log(datawhole$tot_rev_19)- The second variable we want to create is one keeping information about the difference between revenues from 2019 and 2018 to make sense of the revenue variability of the companies within the sample. The initial idea was to firstly calculate the logarithm of revenues 2018 and secondly the delta between the revenues of the 2 years. Unfortunately this operations would have created NaNs since the natural logarithm of a negative number is undefined. To bypass this problem, we created a variable named ******DELTAREV_100****** that is the ratio between revenues from 2019 and revenues from 2018 then multiplied by 100. Doing so, our natural 0 is brought to 100 meaning that if the observation value for this variable is 100, the company in question did not present any increase or decrease in revenues between 2018 and 2019. If the value is more than 100, the company has registered an increase in revenues in 2019, while if the value is less than 100, the company has had a decrease in revenues in 2019. The only forcing to the data set is the dropping of the only observation in 2018 revenues that had 0 as value.
#drop the observation in 2018 revenues that has 0 as value
datawhole<-datawhole[!(datawhole$tot_rev_18==0),]
#create the new variable as ration between 2019 and 2018 revenues multiplied by 100
datawhole['deltarev_100'] <- (datawhole$tot_rev_19 / datawhole$tot_rev_18)*100- Within the variable econ_sect_name, we want create as many variables as the sectors contained in this data set. In each of these variables - that are going to be 9- the sector in question will be coded as 1 while all the others observations as 0. This is a first step of categorization for the QCA we will later perform.
datawhole['energy_1'] <- ifelse(datawhole$econ_sect_name == "Energy", 1,0)
datawhole['bscmaterials_1'] <- ifelse(datawhole$econ_sect_name == "Basic Materials", 1,0)
datawhole['industrials_1'] <- ifelse(datawhole$econ_sect_name == "Industrials", 1,0)
datawhole['conscycl_1'] <- ifelse(datawhole$econ_sect_name == "Consumer Cyclicals", 1,0)
datawhole['consnoncycl_1'] <- ifelse(datawhole$econ_sect_name == "Consumer NonCyclicals", 1,0)
datawhole['financials_1'] <- ifelse(datawhole$econ_sect_name == "Financials", 1,0)
datawhole['healthcare_1'] <- ifelse(datawhole$econ_sect_name == "Healthcare", 1,0)
datawhole['technology_1'] <- ifelse(datawhole$econ_sect_name == "Technology", 1,0)
datawhole['utilities_1'] <- ifelse(datawhole$econ_sect_name == "Utilities", 1,0)- Within the variable continent, we want to apply the same rationale as the one for economic sectors and create as many variables as the geographical areas contained in the variable in question.
datawhole['asia_1'] <- ifelse(datawhole$continent == "Asia Pacific", 1,0)
datawhole['emergmrkt_1'] <- ifelse(datawhole$continent == "Emerging Markets", 1,0)
datawhole['europe_1'] <- ifelse(datawhole$continent == "Europe", 1,0)
datawhole['mddleastafrica_1'] <- ifelse(datawhole$continent == "Middle East Africa", 1,0)
datawhole['nrtamerica_1'] <- ifelse(datawhole$continent == "North America", 1,0)5.3 New Data-set Creation: data
At this point, we are ready the create a new data-set containing only the variables that are going to be useful for the analysis. This will be all the newly created variables described above and the y of the data-set: OVERALL_SCORE, the overall ESG score rate of the companies.
data<- data.frame(datawhole$continent, datawhole$asia_1, datawhole$emergmrkt_1,
datawhole$europe_1, datawhole$mddleastafrica_1, datawhole$nrtamerica_1,
datawhole$log_rev_19, datawhole$deltarev_100,
datawhole$econ_sect_name, datawhole$energy_1,
datawhole$bscmaterials_1, datawhole$industrials_1, datawhole$conscycl_1,
datawhole$consnoncycl_1, datawhole$financials_1, datawhole$healthcare_1,
datawhole$technology_1, datawhole$utilities_1, datawhole$OVERALL_score)
data<- data %>% dplyr::rename(
"CONTINENT" = datawhole.continent,
"ASIA_1" = datawhole.asia_1,
"BRICS_1" = datawhole.emergmrkt_1,
"EUROPE_1" = datawhole.europe_1,
"MDDLEAST_1" = datawhole.mddleastafrica_1,
"NRTAMERICA_1" = datawhole.nrtamerica_1,
"LOG_REV_19" = datawhole.log_rev_19,
"DELTAREV_100" = datawhole.deltarev_100,
"ECON_SEC_NAME" = datawhole.econ_sect_name,
"ENERGY_1" = datawhole.energy_1,
"BSCMATERIALS_1" = datawhole.bscmaterials_1,
"INDUSTRIALS_1" = datawhole.industrials_1,
"CONSCYCL_1" = datawhole.conscycl_1,
"CONSNONCYCL_1" = datawhole.consnoncycl_1,
"FINANCIALS_1" = datawhole.financials_1,
"HEALTHCARE_1" = datawhole.healthcare_1,
"TECHNOLOGY_1" = datawhole.technology_1,
"UTILITIES_1" = datawhole.utilities_1,
"OVERALL_SCORE" = datawhole.OVERALL_score
)5.3.1 data: Variables codebook
These variables have been created from the ones in the original data-set so that they could specifically serve the Qualitative Comparative Analysis (QCA) that we are going to launch next. All the originally qualitative variables, such as the geographical area ones and the economic sector ones, have been split as many dichotomous variables as the number of categories contained in the original variable. In this sense, the geographical area variable in “datawhole” now becomes 5 dichotomous variables in “data”.
Each of these 5 dichotomous geographical area variables present: - “1” if the area corresponds to variable name
- “0” for all the other areas.
The same rationale can be applied for the economic sector name now split into 10 dichotomous variables.
Each of these 10 dichotomous economic sector name variables present: - “1” if the economic sector corresponds to variable name
- “0” for all the other economic sector.
In the interest of order, both the original variables containing geographical area information and economic sector information, respectively CONTINENT and ECON_SEC_NAME have been preserved in this new data-set but will not be effectively used for the analysis.
Moreover, the new data-set “data” contains: * the newly created variables for: - the 2019 revenues converted into logarithm to manage the great magnitude variability of the observations: LOG_REV_19; - the delta between revenues form 2019 and 2018 calculated as the ratio between revenues then multiplied by 100: DELTAREV_100;
- and the y of this analysis:
- unchanged from the original data-set : OVERALL_SCORE.
These last 3 variables are not dichotomous meaning that they will have to undergo changes in terms of calibration during the Qualitative Comparative Analysis.
| Variable Name | Short description |
|---|---|
| CONTINENT | all geographical locations of the companies present in the data-set |
| ASIA_1 | countries located in Asia |
| BRICS_1 | countries being part of the Emerging Markets: Brazil, Russia, India, China, and South Africa |
| EUROPE_1 | countries located in Europe |
| MDDLEAST_1 | countries located in Middle East |
| NRTAMERICA_1 | countries located in North America |
| LOG_REV_19 | 2019 revenues converted into logarithm |
| DELTAREV_100 | delta between revenues form 2019 and 2018 calculated as the ratio between revenues then multiplied by 100 |
| ECON_SEC_NAME | all economical sectors of the companies present in the data-set |
| ENERGY_1 | companies part of the Energy sector |
| BSCMATERIALS_1 | companies part of the Basic Materials sector |
| INDUSTRIALS_1 | companies part of the Industrial sector |
| CONSCYCL_1 | companies part of the Consumer Cyclicals sector |
| CONSNONCYCL_1 | companies part of the Consumer Non-Cyclicals sector |
| FINANCIALS_1 | companies part of the Financials sector |
| HEALTHCARE_1 | companies part of the Healthcare sector |
| TECHNOLOGY_1 | companies part of the Technology sector |
| UTILITIES_1 | companies part of the Utilities sector |
| OVERALL_SCORE | Overall ESG score |
5.3.2 data: Descriptives
The new data-set called data has:
| MIN overall ESG score | MAX overall ESG score | TOT observation number |
|---|---|---|
| 6 | 73 | 3332 |
5.3.3 data: Frequencies Visualizations
Our data-set data will have this kind of representation in terms of visualization of the frequency of the economical sector:
graph_numsector <- data %>%
group_by(ECON_SEC_NAME) %>%
ggplot( aes(x=ECON_SEC_NAME, color=ECON_SEC_NAME)) +
geom_bar( binwidth=3,fill="white", alpha=0.9) +
ggtitle("Number of companies per sector") +
scale_y_continuous(breaks= seq(0,4000, by= 200))
theme_ipsum() +
theme(
plot.title = element_text(size=15)
)
graph_numsector+xlab("Economic sector name")
While it presents this distribution for the geographical area end:
graph_numcontinent <- datawhole %>%
group_by(continent) %>%
ggplot( aes(x=continent, color=continent)) +
geom_bar( binwidth=3,fill="white", alpha=0.9) +
ggtitle("Number of companies per geographical area") +
scale_y_continuous(breaks= seq(0,4000, by= 200))
theme_ipsum() +
theme(
plot.title = element_text(size=15)
)
graph_numcontinent +xlab("Geographical area")